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Abstract 

Quantile regression has received increased attention in the statistics community in re- 
cent years. This article adapts an auxiliary variable method, commonly used in Bayesian 
variable selection for mean regression models, to the fitting of quantile regression curves. 
We focus on the fitting of regression splines, with unknown number and location of 
knots. We provide an efficient algorithm with Metropolis-Hastings updates whose tun- 
ing is fully automated. The method is tested on simulated and real examples and its 
extension to additive models is described. Finally we propose a simple postprocess- 
ing procedure to deal with the problem of the crossing of multiple separately estimated 
quantile curves. 

Keywords: Quantile regression; Curve fitting; Gibbs sampling; Splines; Additive mod- 
els; Automatic tuning; Noncrossing curves. 

1 Introduction 



Quantile regression has been recognized in recent years as a robust statistical procedure 
that offers a powerful alternative to the ordinary mean regression, especially when the 
data contains large outliers or when the response variable has a skewed or multimodal 
conditional distribution. Given a fixed probability p, < p < 1, let the model corre- 
sponding to the p-th quantile regression curve be given by 

Yi\xi,...,Xn ~ fp{xi) + ei, i = l,...,n 

where ei, e„ are independent draws from a noise distribution whose p-th quantile is 
0, i.e. P(e < 0) = p. Under this model the p-th quantile of the conditional distribution 
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of Y given {X = x} is given by some smooth function fp{x). If the distribution of the 
noise is left unspecified then the estimation of fp is typically carried out by solving the 
minimization problem, for a given class T of curves. 



n 

arg minS^Ppiyi 

h^^ i=i 



(1) 



where the so-called "check function" Pp{.) is given by Pp{e) = pe if e > and Pp{e) = [p — 
l)e otherwise (see lKoenker and Bassett 1978|l . To define a likelihood function, one usu- 
ally assumes that the noise distribution is an asymetric Laplace distribution so that the 
maximum likelihood estimate corresponds to the solution of the minimization problem 
(see IKoenker and Machado 1999ll. See e.g. |Yu et al. (2003)] or [Koenker (2005)| for a review 
on quantile regression and Geraci and Bottai (2007) for quantile regression with longitu- 



dinal data. References on Bayesian treatments of the subject include Tsionas (2003) for 



inference on a single quantile, Yu and Moyeed (2001) for quantile regression with a ran- 
dom walk Metropolis-Hastings algorithm and Yu (2002) for quantile regression with a re- 



versible jump MCMC sampler (RIMCMC. lGreen 1995ll . More recentlyfVue and Rue (2011) 



considers additive mixed regression models and inference with either MCMC sampling 

or the integrated nested Laplace approximation (INLA. lRue et al. 2009)) and Kozumi and Kobayashi (2011) 

proposed quantile regression with a Gibbs sampler. 

In this article, we are interested in the case where the curve fp is modeled with spline 
functions of a given degree, P > 1, so that. 



K 



fp{x) = ao + ^ ajx^ + ^ r]k{x - -fkY- 



(2) 



k=l 



where z+ = max(0, z) and where 7fc, /c = 1, . . . ,K represent the locations of K knot 
points (see lHastie and Tibshirani 1990|) . Typically, the degree P is set to equal 3, since cu- 



bic splines are known to approximate locally smooth functions arbitrarily well. Chen and Yu (2009) 



provides a Bayesian inference on this model, where the number of knots and their loca- 
tion are automatically selected. Their method relies on a RJMCMC algorithm which, un- 
der the prior specifications they use, needs to compute an approximation of the ratio of 



marginal likelihoods. For fitting of quantile smoothing splines see Koenker et al. (1994) 



and He and Ng (1999) and for a Bayesian inference with natural cubic splines see Thompson et al. (2010) 

We propose here an alternative strategy that avoids the use of the RJMCMC sam- 
pler which can often be difficult to tune (see IFan and Sisson 20111 for a review) and that 
does not rely on approximations to simplify computations. Recognising that a Bayesian 



variable selection technique (e.g. George and McCuUoch 1993 1 can be used for infer- 
ence on a curve (e.g. ISmith and Kohn 1996llFan et al. 20101) we use an auxiliary variable 
approach which makes possible, under appropriate prior specifications, a Metropolis- 
Hastings within Gibbs sampler. The proposed MCMC sampler is easy to implement and 
fully automated. In particular it incorporates an algorithm which automatically tunes 
the scaling parameters used in our Random-walk Metropolis-Hastings algorithm. 

In Section|2]we present the model and the prior specifications, then we describe how 
inference is carried out with a MCMC sampler. We apply the method on several datasets 
in Section |3] In Section |4] we consider quantile curve regression for additive models. 
Finally, in Section |5l we discuss the problem of crossing quantile curves and propose a 
simple postprocessing procedure to reweight the MCMC samples from separately esti- 
mated quantile curves. 
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2 Quantile regression with splines 



For some < p < 1, and given paired observations (xi, yi), . . . , we are inter- 

ested in fitting the p-th quantile regression model 

Yi\xi,...,Xn ~ fp{xi) + €i, i = l,...,n (3) 

where ei, e„ are independent draws from the asymetric Laplace distribution 



'^ALp{0,<T) (e) 



p{l-p) 



a 



exp 



:Pp(e)} 



(4) 



for an unknown scale parameter a > 0. Under this model the p-th quantile of the 
conditional distribution of Y given {X = x} is fp{x). The asymetric Laplace distri- 



bution has been adopted in many papers, see for example |Koenker and Machado (1999) 



Yu and Moyeed (2001)t[Tsionas (2003)t|Chen and Yu (2009)trYue and Rue (2011)| or [Kbzumi and Kobayash 



Under the asymmetric Laplace distribution, given a, the function fp maximizing the 
likelihood corresponding to model ^ is also the solution of the minimization problem 
in Equation ((T]). The scale parameter a that takes into account the variability of the ob- 
servations is considered as a nuisance parameter. 

We consider hereafter that the curve fp is modeled with spline functions of a given 
degree -P > 0, in the form of Equation Q. Under this representation, fitting the curve 
consists of estimating the number of knots K, the knot locations 7A; , = 1, . . . ,K, and 
the corresponding regression coefficients aj, j = 0, . . . ,P and r]k, k = 1, . . . , K. If j^, k = 
1, ...,Kmax, where K^ax represents the (known) maximum number of potential knots, 
model (O can be written as the linear model 



Y = X^/3 + e 



(5) 



where Y = (yi, . . . ,y„)', (3 = (ao,ai, . . . , ap,r]i, . . .,mmaJ'' e = (ei, • • • ,en)', with de- 
sign matrix 



X^ 



(l„,x, . . . ,x^, (x - l„7i):^, . . . , (x - 1„7K„„J^) 



(6) 



where x = (xi, . . . , x„)' and where 1„ = (1, . . . , 1)' denotes the unit vector of size n. 



2.1 The model and prior assumptions 

We adopt an auxiliary variable approach for the spline regression model by introducing 
a vector of binary indicator variables z^, = 1, . . . , Kmax, 

_ J 1 if there is a knot point 7^ in the interval 1^ and ry^ ^ 
I if there is no knot point in the interval and % = 

where 77^ denotes the spline coefficients in model and the intervals 1^ are defined on 
the range of the xi's. Each interval Ik contains at most one knot with unknown location 
7fc. In practice, such intervals can be defined by either using prior information on regions 
where a knot is suspected or, in the absence of such prior information, an equal partition 
of the range may be adopted. We denote the vector (71, ... , 7/<„„^)' by 7 and consider 
the Uniform distributions on the interval as the prior distribution on 7. Each possible 
value for 7 gives a model of the form fSj). Let ,y denotes the matrix constructed with 
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the columns of corresponding to non-zero entries in z, and let /S^ ,y denotes the vector 
of corresponding regression coefficients. 

A desirable feature of the asymetric Laplace distribution is that it can be decomposed 
as a scale mixture of normals (see e.g. ITsionas 2003llYue and Rue 2011l or Kozumi and Kobayashi 2011 1 

'1 — 2p)w 2aw 



w ~ £xp{l/a), 

where £xp{l/a) denotes the exponential distribution with mean a. If Wi, i = 1, . . . ,n 
denote the variable w associated with each gj, the conditional distribution of Y given W, 
the diagonal matrix with entries Wi, i = 1, ... ,n, is 

f{Y\X,,„l3,,„z,a,^,W) = (x,,,/3,,, + g^m., ^^^W-) . (7) 

Conditional on W we use the following decomposition of the joint prior distribution of 
the unknown parameters 

'n-{Pz,-y,z,a,-f\W) = 7r/3^_^ (/?^,^ 1 2:,cr, 7, VF)7r^(cj)7r2(2;)7r^ (7), 

where we set 



(8) 



This conditional prior for (3z,'y, related to 5-priors (|Zellner 1986|l , has the advantage of 
conjugacy in the case of normal errors, in which case the regression and variance param- 
eters can be analytically integrated out. 

Different choices for the parameter c have been proposed in the literature for mean 
regression problems. The case c = n, where n is the sample size, corresponds to the 
unit information prior which was used by [DiMatteo et al. (2001) a default choice that 
works well in practice in Bayesian variable selection problems with large sample sizes. 



Smith and Kohn (1996) recommend values of c in the range 10 < c < 1000 for the prob- 



lems they considered. Here including an adaptive scale parameter c, and treating it as 
another parameter was more satisfactory than using a fixed one. Thus we include a 
hyper-prior for c, following e.g. [Leslie et al. (2007)} we use a diffuse prior IQ{1, 2n) with 
a mode at 7i 



7r(c) oc c ^exp{— 2n/c}. 



See Liang et al. (2008) for more discussion about the choice of a prior distribution on the 
parameter c. 

For the variance parameter, we use the standard uninformative prior iTaicr) oc 1/a. 
Finally, we need to define the prior distribution for z, we consider the decomposition of 
this prior given by 

■^z{z) = 7r{z I |z|)7r(|2:|) 

where \z\ = Ylk=i^ is the number of non-zero entries in z, i.e. the number of knots 
that are used in the corresponding model. We use for this term a Poisson distribution 
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with mean A that is right-truncated at a specified maximum value, L. We assume also 
that, given this quantity, all possible configurations for z have equal probabilities, so that 

T^z{z) OC ■^I{\z\<L}- 

The parameters arid a can be integrated out of the full joint posterior distribution 

7r{f3z,'y, z, (7,7, W, c\Y) and we get 

\ — 3n/2 

where 
and where 

Details of the marginal posterior are given in Appendix A. 

2.2 Inference on the posterior distribution 

An MCMC sampler is used for the inference on the model. Based on the posterior distri- 
bution (|9]), for each t*^ iteration of the MCMC update, i = 1, . . . , T, perform the following 
successive updates for z, 7, W and c: 

• Update z. This update involves two types of moves; with probability 0.5 we pro- 
pose an add/delete step, otherwise a swap step is proposed. Specifically, the two 
move steps involve 

- add/ delete: randomly select a Zk and propose to change its value; 

- swap: randomly select two values Zi and zj, and propose to exchange their 
values. 

In both cases, proposed moves from current value z to proposed value z' are ac- 
cepted with the usual Metropolis-Hastings acceptance probability 

. . /. 7T{z',^,W,c\Y)q{z',z) ] 
a{z, z ) = mm <^ 1, — ... , jr \ 

where q(z, z') is the probability of proposing the new value z' given the current 
value z. 

• Update 7. For each k = 1, . . . , K^ax, we differentiate the cases when Zf^ = and 
when Zfc = 1: 

- if Zfc = then 7/^. is updated according to its prior distribution, i.e. a Uniform 
distribution on 7^; 
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- if Zk = 1, 7fe is updated to a new value 7^,, according to the posterior distribu- 
tion 

fpil-p) " 1"^"^' 

7r{-fkhj^k,z,Y,W,c) oc I Sz,-y,w,cO^) + ^7(7)- 

An independence Metropolis-Hastings step can be used for this last type of updat- 
ing, using the prior on 7^ as a proposal, with the corresponding acceptance proba- 
bility given by 

a(7fe,7fc) =min<^ 1, , , TFTTT^ { ■ 

Update W. Each Wi,i = 1, . . . ,n has conditional posterior distribution 

1 fp(i-p) " r'"^' 

7r(?i;i|'u;j^j7,z,c,y) oc ™ —< S;,^-y^w,c(Y) + ^^1} 

We use a Random- walk Metropolis-Hastings proposal to update each Wi. We con- 
sider as proposal distribution a normal distribution q{wi, .) = N{wi,af) with mean 
and variance af. We sample u;- ~ q{wi, .), then the proposed value w'^ is accepted 
with probability 



a{wi,w'j) = min 



^ 7r(w-|wj^j7,z,c,y) ] 



The tuning parameters (Tf,i = 1 , . . . , n are optimally obtained automatically, prior 
to starting the main part of MCMC, see Appendix B. 

Update c. The parameter c has conditional distribution 



(C+ l){kl+^'+l)/2 



j=l 



We use a Random-walk Metropolis-Hastings proposal to update c. We sample c' 
0^(0, .) = N{c, cr^) then accept the proposed value with acceptance probability 

. /^ . /. vr(c'|z,7,W,y) 
a{c, c ) = mm < 1, 



tt{c\z, 7, W, Y) 

The tuning parameter cr^ is also obtained via the algorithm in Appendix B. 

Note that when the sample size n is large, the number of parameters in the Update W 
step becomes large and correspondingly manual tuning of the scale parameters af in the 
Gaussian Random- Walk Metropolis-Hastings sampler becomes infeasible. One strategy 
to automate the sampler is to use a slice sampler (see INeal 20031) . But the additional 
evaluations of the posterior function makes this algorithm much more computationally 



intensive. In this article we use the algorithm of Garthwaite et al. (2010) that automati- 
cally tunes the scaling parameters af and obtains an optimal over all acceptance rate of 
p* = 0.44 ([Roberts and Rosenthal 2001 |l for these univariate updates. See Appendix B for 
a description of the algorithm used for tuning. 
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Once a converged MCMC sample {{z^^\j^^\ W^^\ c(*^)}t=i,...,T is obtained it is pos- 
sible to estimate the curve fp{x) by a Bayesian model averaging approach (BMA). The 
posterior expectation for /3 given z, 7, W and c is 

n^,,^\z,7,w,Y,c) = ^(x;^i^-ix,,^)-ix;^VF-iy(H/). (10) 

Thus an estimate for fp{x) can be obtained by 

1 ^ r(*) 

Another possibility to estimate the curve fp{x) is to use the maximum a posteriori (MAP) 
estimate for (z, 7, VF, c) 

{z,j,W,c) = argmax7r(zW,7W,l^W,cW|y), 

l<t<T 

then calculate the corresponding curve estimate via 



3 Examples 

3.1 Simulation studies 

We carry out simulations to compare the use of the method described in this paper with 



the method COBS proposed by He and Ng (1999) COBS estimates both constrained and 



unconstrained quantile curves using B-spline smoothing and is available as an R pack- 
age. Here we use the unconstrained case as a fully automated procedure, where both 
the smoothing parameter and the selection of knots is carried out according to either the 
AIC or the BIC criterion. 

We consider simulated datasets that correspond to the three examples described bel- 
low, these examples are adapted from some well known examples in the curve fitting lit- 
erature, see e.g. [Smith and Kohn (1996)]|Denison et al. (1998) | and [DiMatteo et al. (2001) 



Example 1: Here the curve takes the form 

/(x) = (/>(x, 0.15, O.O52)/4 + 0(x, 0.6, 0.22)/4, x G [0,1], 

where cl){x,fi,a^) denotes the value at x of the normal density with mean fi and 
variance cr^. n = 200 data points x are sampled from the Uniform distribution 
C/(0, 1). The noise e is added to the data, they corresponds to a Gamma distribution 
Qa{l,A) with shape parameter 1 and rate parameter 4 that is translated by -0.175 
(so that the median of this noise distribution is approximatively 0). 

Example 2: Here the curve takes the form 

/(x) = sin(2x) + 2exp(-162;2), xe [-2,2]. 

and is evaluated at n = 201 regularly spaced grid points. This function is first 
rescaled so that the support is on the unit interval. The noise e added to the data is 
simulated in the same way as in the first example. 
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Example 3: In this example the curve is given by, 

f{x) = sin(x) + 2exp(-30x2), x G [-2,2], 

and the data points x correspond to n = 201 regularly spaced grid points. As in the 
previous example the function is rescaled on the unit interval for x and the same 
distribution for noise e is used for the data. 

To compare the different methods we use the mean squared error (MSE) as a measure of 
goodness of fit, given by 

1 " 

MSE=-V{/(x,)-/(xO}2 



n 



=1 



where / is the true median regression function and / is the estimated function. Since 
the COBS algorithm computes the median curve with quadratic (or linear) splines we 
consider hereafter the case P = 2. 

For the prior specifications of each example, we set A = 3 and L = 10 for the trun- 
cated Poisson prior. Results are largely insensitive to values of A around this range and 
the maximum number of knots allowed L is chosen to be large enough to not affect 
the simulation results here. For these examples we consider the situation where there 
is no prior information on the knot locations and chose the intervals 1^ to correspond 
to the ranges given by every sorted x values. We found that rix = 5 was sufficient 
to provide a good fit in each of the three examples. We use a B-spline basis to formu- 
late the Xz,^ matrix, as in DiMatteo et al. (2001)[ to avoid numerical instability (see e.g. 



Ruppertet al. 20031. 



The computation of all three examples started with an arbitrary set of initial values 
generated from the prior distributions. We first ran the algorithm 500 iterations for adap- 
tive tuning then fixing the scaling parameters of the Random- Walk Metropolis Hastings 
algorithm at the final value of the tuning run, we then ran a burn-in of 500 iterations, 
followed by 1,500 recorded iterations. Each iteration involves an update of 20 z update 
steps for each 7 update step. To assess convergence, we monitored the trace plots of each 
model parameters as well as posterior values. We also ran much longer chains of 10,000 
iterations and found the results to be similar in terms of MSE calculations. See Figure [T] 
for the fitted functions of the three examples using our method with the BMA estimate 
and with the MAP estimate. 

For each of the three examples the BMA, MAP and COBS (with the AIC or the BIC 
criterion) estimates are calculated over 50 randomly generated datasets. The mean and 
standard deviation of the MSEs are presented in Table [TJ the corresponding boxplots 
are given in Figure |2] On the whole the method presented in this paper performs well 
compared to COBS, especially on the datasets corresponding to Example 3. On the three 
types of datasets that are considered here, the BMA estimates seem to be more accurate 
than the MAP estimates. 



3.2 Motorcycle data set 

We consider a reference dataset, the motorcycle data, studied in the context of quantile 
regression for example in Koenker (2005) or in IChen and Yu (2009) | These data are ana- 
lyzed in Silverman (1985) and contain experimental measurements of the acceleration of 
the head of a test dummy (expressed in g, acceleration due to gravity) as a function of 
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liMA 


A /T A O 
MAI 


AIC 


BIC 


Example 1 


n nnoo 
U.UUoZ 


O.OOoo 


O.OOOZ 


n nn^tx 
0.00 


(0.0017) 


(0.0021) 


(0.0033) 


(0.0069) 


Example 2 


U.UU4U 


U.UUo 1 


n nnfin 
U.UUoU 




(0.0025) 


(0.0038) 


(0.0029) 


(0.0026) 


Example 3 


0.0036 
(0.0018) 


0.0056 
(0.0025) 


0.0084 
(0.0028) 


0.0139 
(0.0044) 



Table 1: Mean MSEs with estimated standard errors in brackets based on 50 samples obtained using 
the Bayesian model averaging (BMA), the maximum a posteriori (MAP) and the COBS algorithm 
(with the AIC criterion and with the BIC criterion). 



time in the first moments after an impact (the time is expressed in ms). The dataset is 
challenging for quantile regression as the the values and the variability of the response 
vary dramatically with the independent variable. 

We fit to these data the quantile regression curves corresponding to p = 0.25,0.5 
and 0.75. The prior settings are esentially the same as the ones already described in 
the simulation studies, except here we set A = 5 and L = 15 for the truncated Poisson 
prior. For the MCMC computation of the curves we started with an arbitrary set of initial 
values generated from the prior distributions. Again we used the first 500 iterations 
for adaptive tuning then we ran a burn-in of 500 iterations followed by 3,500 recorded 
iterations, where each iteration involves an update of 20 z update steps for each 7 update 
step. 

We give in Figure|3]the quantile curves corresponding to linear splines P = 1. The re- 
sults appear quite satisfactory as the quantile curves are not crossing each other, even in 
the region beyond 50 millisecond where the data are sparse. The changes in the variabil- 
ity of the acceleration over time has been captured well by the fitted conditional quantile 
curves, as they are very close to each other for the first few milliseconds then diverge 
after the crash. 



4 Quantile regression for additive models 
4.1 Introduction 

When several potential predictors for the response variable are of interest, a standard 
procedure to avoid the so-called "curse of dimensionality" is to use an additive model 
(|Hastie and Tibshirani 1990]l where the response is modeled as a sum of functions of the 
predictors. In the context of quantile regression, if Y denotes the real-valued response 
variable and if now X = (X^, X'^) denotes a vector of d predictors, the p-th quantile 
of the conditional distribution of Y given {X = x} is modeled as 

d 

/p(x) = 5]/^(x^). (11) 
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See Yu and Lu (2004) for an inference on the additive quantile regression model by a 



kernel-weighted local linear fitting and see Yue and Rue (2011) for a Bayesian inference 



with either a MCMC algorithm or using INLA. 

If we use spline functions to model the different curves /p(x^), ...,fp{x'^) it is still 
possible to use the linear model (O with the difference that the design matrix is now 
made up of the columns of the individual design matrices corresponding to with a 
single intercept term for identifiability. Thus inference on the additive quantile regres- 
sion model can be performed via the same methodology and algorithm described in the 
previous sections. We consider below the study of a real dataset that involves additive 
quantile regression. 

4.2 Analysis of the Boston housing dataset 

We revisit the so-called Boston house price data available in the R package MASS. This 
dataset has been originally studied in [Harrison and Rubinfeld (1978) The full dataset 



consists of the median value of owner-occupied homes in 506 census tracts in the Boston 
Standard Metropolitan Statistical Area in 1970 along with 13 various sociodemographic 



variables. This dataset has been analyzed in many statistical papers including Opsomer and Ruppert (1998 



who used an additive model for mean regression, and Yu and Lu (2004)] who proposed 



an additive quantile regression model by a kernel-weighted local linear fitting. As in 
these two references we consider the median values of the owner-occupied homes (in 
$1000s) as the dependent variable and four covariates given by 

RM = average number of rooms per house in the area, 

TAX = full property tax rate ($/$10,000), 

PTRATIO = pupil/ teacher ratio by town school distric, 

LSTAT = the percentage of the population having lower economic status in the area. 



As noticed in Yu and Lu (2004) these data are suitable for a quantile regression analysis 
since the response is a median price in a given area and the variables RM and LSTAT are 
highly skewed. More precisely we consider the additive model where the p-th quantile 
of the conditional distribution of the response is given by 

/p(x) = ao + /p^RM) + /2(log(TAX)) + /3(PTRATIO) + /p4(log(LSTAT)). 

We fit to these data the p-th quantile regression curves corresponding to cubic splines 
(P = 3) at the quantile levels p = 0.25, 0.5 and 0.75. For the prior settings we took A = 5 
and L = 8 for the truncated Poisson prior. For each predictor we set the intervals Ik to be 
10 equally sized partition sets over the range of the variable. Excluding the possibility of 
knots in the first and the last intervals, we get Kmax = 8 for each variable. For the MCMC 
computation of the curves we started with a random set of initial values generated from 
the prior distributions. We first ran the algorithm 500 iterations for adaptive tuning then 
we ran a burn-in of 500 iterations, followed by 4,000 recorded iterations, where each 
iteration involves an update of 20 z update steps for each 7 update step. We present in 
Figure m the different estimated curves. We plotted on the same graphs the datapoints 
corresponding to the original data minus the effect of all the other variables and the 
constant term. The fact that the values of log (TAX) are not well dispersed over their 
range and the presence of a few outliers in the dataset did not seem to be a problem for 
our method. 
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Our results appear consistent with the results provided in the quoted previous anal- 
yses. Briefly, the variables RM and LSTAT appear as the most important covariates. If 
the contribution of LSTAT look similar for the three quantiles levels, the contribution of 
RM looks slightly more important for the upper quantile level p = 0.75. The variable 
TAX has a contribution relatively more important for the lower quantile level p = 0.25. 
Finally the Figure H] suggests a linear contribution of the variable PTRATIO, especially 
for p = 0.5 and for p = 0.75. 



5 Noncrossing quantile regression curves 

One known problem when using quantile regression for multiple percentiles is that the 
quantile curves that are estimated separately can cross, which is impossible. See for 
example the Figure |5] (a) where, partly due to the relatively small size of the dataset and 
the complex conditional distribution of the response variable, the two estimated quantile 
curves for p = 0.2 and p = 0.3 are crossing around the value x = 0.6. 

The treatment of noncrossing quantile regression curves is difficult and several at- 
tempts to circumvent this problem have been proposed in different settings, see e.g. the 
references in Yu et al. (2003)| and in Koenker (2005) or, for a more recent development in 



this area, see e.g. [Reich et al. (2011) In particular, Bondell et al. (20l0)] proposed a so- 
lution to this problem by considering a generalization of the criterion ^ to the case of 
simultaneous inference on several quantile curves. For clarity we suppose hereafter that 
we are interested in the fitting of two quantile curves corresponding to quantile levels pi 



and p2, with pi < P2- [Bondell et al. (20l0)| gave a solution to the minimization under the 



constraint < /pjC-) of the expression 

^Pp.iVi- fpjixi))^ (12) 
plus a penalty term corresponding to smoothing. An alternative approach described in 




Dunson and Taylor (2005) uses a so-called "substitution likelihood" that does not corre- 
spond to the distribution of the data given the unknown curves but yields a valid uncer- 
tainty. The substitution likelihood that they considered corresponds to the multinomial 
weights 

where ui represents the number of datapoints below the curve /i, where U2 represents 
the number of datapoints between the two curves and where us is the number of data- 
points above the curve /2. They gave conditions on the prior for the "pseudo-posterior" 
7r(/pj, /pjly) tx ^(/pi, /p2l^)'^(/pn fp2) to be proper and proposed a MCMC algorithm for 
(pseudo-)posterior computation in the case of linear quantiles. 

Here we propose a new method to postprocess the MCMC samples obtained from 
separate quantile regression curve fitting. We denote by dp = {Pz,'y, <7, z, 7, W, c) the full 
set of unknown parameters for the p-th quantile regression curve model 0. Let 9p-^, 6p^, 
be the parameters corresponding to the quantile regression curve for the quantile levels 
Pi and p2 respectively, pi < P2- We consider a new substitution likelihood of the form 

s{ep,,ep,\Y) = L{ep^Y)L{ep,\Y)Iy^^^,\e,,)<f,,i.\e,,)} (14) 
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where L{9p-^\Y) and L{9p2\Y) denotes the two likelihood functions for quantile levels pi 
and p2 given by the conditional distribution ((T)). The indicator function takes the value 
one if fpi{x\Op^) < fp^{x\6p2) for all x and zero otherwise, here the function fp{x\6p) is 
evaluated according to Equation ^ with parameters 6p. It is not hard to see that the 
maximizer of this substitution likelihood is the maximizer of ST2^ . Moreover, if we take 
independent priors ir^Op^ ) and vr(0p2 ) on the two sets of parameters, then the correspond- 
ing quasi-posterior is simply 

i^pi ) ^P2 pi)'^{dp2), 

OC ir{9p-^\Y)ir{9p2 \^)I{fp-^{x\ep-^)<fp2(x\ep,^)}- 

(15) 

Given samples from the distribution 7r{9p^\Y) ® Tr{9p2\Y) an importance sampling argu- 
ment can be used to reweight the samples according to this quasi-posterior. 

In practice, MCMC samples obtained from separate posterior explorations of 7r(0p^ \Y) 
and of vr(0p2 \Y) can be combined to form the new estimate of the the curve fp^ (x) by 

When the constraint above excludes too many samples this estimator will be unreliable, 
in this case more MCMC samples will be required. A computationally cheap way to 
obtain more samples is to consider all combinations of the two MCMC samples. 

Figure |5] (b) shows the corrected curves from the estimator in ((16)) , using all the pos- 
sible combinations of the two MCMC samples, each of size 2,000. To evaluate the curves 
we use here the plug-in estimator ((T0)| for P^.^y- Finally the constraint on the curves is 
checked at every observed values of x. 

The strength of the above approach is that it is very easy to apply, and can be used on 
any posterior samples from separate quantile curves. An obvious draw back is that in 
some cases, when for example pi and p2 are very close, the number of samples satisfying 
the constraint can be extremely low. 



6 Conclusion 

In this article, we have provided a procedure for Bayesian inference on quantile curve 
fitting. We focused on the use of regression splines with unknown number of knots and 
location to obtain smooth curves. We have seen that, within an auxiliary variable frame- 
work, a scale mixture of normals representation for the asymmetric Laplace distribution 
together with appropriate prior specifications makes it possible to integrate out the re- 
gression and the variance parameters analytically. This facilitates a simple Metropolis- 
Hastings within Gibbs sampler for simulation from the posterior distribution of interest. 
The proposed algorithm is fully automated with the inclusion of an automatic tuning 
step, which optimally tunes the Random-Walk Metropolis-Hastings scaling parameters. 
We have shown that our method performs well on several types of datasets. We have 
also shown that the proposed framework can be trivially extended to inference on addi- 
tive models. Finally we have proposed and discussed a simple and general procedure 
that postprocesses MCMC samples to obtain noncrossing quantile regression curves. 
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Appendix A 



The marginal posterior 

The full joint posterior distribution of the parameters is 

vr(/32,7, z, cj, 7, W, c\Y) oc f{Y\X^^^, /S^,^, z, a, 7, iy)7r(iy|cj)7r(/3^,^ , z, a, 7|iy)7r(c), 

oc /S^,^, z, a, 7, VF)7r(VF|o-)7r/3^_^ (/?2,7k, 7> ^) 

X7r^(a)7r2(z)7r^(7)7r(c). 

With/(y|X,,^,/3,^ ,7, z, cr, 7, ly) and -Kp.^^^ {fiza\^, cr, 1, W) given by the two Gaussian dis- 
tributions (O and ^ the parameter Pz,'^ is easily integrated out using classical results of 
Bayesian linear models. We get 

Vc + iy \M=iWi \(tJ 

where 
with 

Then the parameter a can be also integrated out and we get 

|z| + P + l / \ ^ 

<^,1,WAY) oc ^illllllf—^ 1 ) vr(c). 



Appendix B 

Automatic tuning algorithm 

Here we provide the algorithm to optimally search for the tuning parameters af,i = 
1, . . . , n, and crj. The algorithm runs within the main MCMC algorithm given in Section 
12.21 Tuning will only apply to the Update W and Update c steps. For the update of each 
of the parameters Wi,i = 1, . . . ,n and c do: 



Initialisation: For the iteration f = 1 of the algorithm initialise the scaling parameter 
cr* = (T^ = I, where corresponds to cij and ex* when updating the parameters Wi, 
i = 1, . . . ,n and c respectively. Set p* = 0.44 and initialise j = . The value of p* corre- 
sponds to the optimal acceptance probability for a univariate Random-Walk Metropolis- 
Hastings algorithm ([Roberts and Rosenthal 2001 1) . 

Tuning: Set j = j + 1; update the parameters according to either Update W or update c, 
and obtain the corresponding acceptance probability a as in Section |Z21 
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update scaling: if j < 20 set a^~^^ = cr*, else set 



a 



.t+i 



{ 



a^ + K{l-p*)/j ifU<a 
fj* - K.p*/j if [/ > a 



where k = o-7{p*(l - p*)} and U ~ U{0, 1). 

restart the algorithm: If t < 100, and either (7*+^ > 3a* or (T*+^ < a* /3, restart the 
algorithm, setting a* = 0"*+^ and j = 0. Note we do not restart the algorithm again if the 
total number of restarts exceeds 5. 

Increment loop: Set t = t + 1. Go back to the beginning unless t exceeds some pre- 
specified number of iterations riTune. 

It is easy to monitor the changes in cr* in order to determine the number of tuning it- 
erations riTune to achieve stability. In practice, we run the first nTune iterations of the 
algorithm in Section |Z2] with automatic tuning, and then start the main part of MCMC 
as usual with the scaling parameters fixed at the value (j"-'^""'^. 
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(a) Example 1 
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(b) Example 2 
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(c) Example 3 

Figure 1: Estimated curves for the three simulated examples. 
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(a) Example 1 



(b) Example 2 
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(c) Example 3 

Figure 2: Boxplots for the MSEs corresponding to the three simulated examples. 
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Figure 3: Motor cycle data set ; estimated quartile regression curves by BA approach for P = 1 
(p=0.5: solid line ; p=0.25 and p=0.75: dotted lines) 
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Figure 4: Boston housing dataset ; fitted quantile curves p = 0.25,0.5,0.75, P = 3, for the four 
variables that have been considered. One each figure the datapoints represented correspond to the 
original data minus the effect of all the other variables (and the constant term). 
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(a) Initially estimated curves 




(b) Corrected curves 

Figure 5: Curve crossing example. The dotted lines represent the true quantile curves for p = 0.2 
and p = 0.3. The solid lines represent (a) the quantiles curves that have been estimated separately (b) 
the corrected estimated quantile curves with respect to the new substitution likelihood. 
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